BAM | 1000 Genomes

Are there any scripts or APIs for use with the 1000 Genomes data sets?

Answer:

There are a number of tools available in the Tools page of the 1000 Genomes Browser.

Our data is in standard formats like SAM and VCF, which have tools associated with them. To manipulate SAM/BAM files look at SAMtools for a C based toolkit and links to APIs in other languages. To interact with VCF files look at VCFtools which is a set of Perl and C++ code.

We also provide a public MySQL instance with copies of the databases behind the 1000 Genomes Ensembl browsers. These databases are described on our public instance page.

How are your alignments generated?

Answer:

The 1000 Genomes Project has used several different alignment algorithms during its duration:

Project stage	Sequencing technology	Alignment algorithm
Pilot	Illumina	MAQ
Pilot	SOLiD	Corona lite
Pilot	454	ssaha
Main	Illumina	BWA
Main	SOLiD	BFAST
Main	454	ssaha (first set)
Main	454	smalt (final set)

The full process is described in the README

How do I get a sub-section of a BAM file?

Answer:

There are two ways to get subsections of our BAM files.

The first is to use the Data Slicer tool from our browser which is documented here. This tool gives you a web interface requesting the URL of any BAM file and the genomic location you wish to get a sub-slice for. This tool also works for VCF files.

The second it to use samtools on the command line, e.g

samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 17:7512445-7513455

Samtools supports streaming files and piping commands together both using local and remote files. You can get more help with samtools from the samtools help mailing list

What are CRAM files?

Answer:

CRAM files are alignment files like BAM files. They represent a compressed version of the alignment. This compression is driven by the reference the sequence data is aligned to.

The file format was designed by the EBI to reduce the disk footprint of alignment data in these days of ever-increasing data volumes.

The CRAM files the 1000 genomes project distributes are lossy cram files which reduce the base quality scores using the Illumina 8-bin compression scheme as described in the lossy compression section on the cram usage page

There is a cram developers mailing list where the format is discussed and help can be found.

CRAM files can be read using many Picard tools and work is being done to ensure samtools can also read the file format natively.

What are the unmapped bams?

Answer:

The unmapped bams contain all the reads for the given individual which could not be placed on the reference genome. It contains no mapping information

Please note that any paired end sequence where one end successfully maps but the other does not both reads are found in the mapped bam

What is a bas file?

Answer:

Bas files are statistics we generate for our alignment files which we distribute alongside our alignment files.

These are readgroup level statistics in a tab delimited manner and are described in this README

Each mapped and unmapped bam file has an associated bas file and we provide them collected together into a single file in the alignment_indices directory, dated to match the alignment release.

What is the Data Slicer?

Answer:

The Data Slicer is a web based tool in our browser which allows you to get subsections of our indexed VCF and BAM files.

What format are your alignments in and what do the names mean?

Answer:

All our alignment files are in BAM format, a standard alignment format which was defined by the consortium and has since seen wide community adoption. We also provide our alignments in CRAM Format

The bam file names look like:

NA00000.location.platform.population.analysis_group.YYYYMMDD.bam

The bai index and bas statistics files are also named in the same way.

The name includes the individual sample ID, where the sequence is mapped to, if the file has only contains mapping to a particular chromosome that is what the name contains otherwise, mapped means the whole genome mapping and unmapped means the reads which failed to map to the reference (pairs where one mate mapped and the other didn’t stay in the mapped file), the sequencing platform, the ethnicity of the sample using our three letter population code, the sequencing strategy. The date matches the date of the sequence used to build the bams and can also be found in the sequence.index filename.

Where are your alignment files located?

Answer:

Our main alignment files are located in our data directory. Our mapped bams contain reads which aligned to the whole genome.

You can find an index of our alignments in our alignment.index file. There are dated versions of these files and statistics surrounding each alignment release in the alignment_indicies directory. Please note with few exceptions we only keep the most recent QC passed alignment for each sample on the ftp site.

We also have frozen versions of the alignments use for both the pilot and the phase 1 analyses in different directories on the ftp site. Please note the that the pilot alignments are mapped to NCBI36 rather than GRCh37 like all other alignments on the ftp site.

Where can I find phase3 alignment BAM files and read fastq files on the ftp site?

Answer:

You can find all the 1000 Genomes phase 3 BAM and fastq files in:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data

All BAM files from IGSR can be found in:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections

Why are there chr 11 and chr 20 alignment files, and not for other chromosomes?

Answer:

The chr 11 and chr 20 alignment files are put in place to give the 1000 Genomes analysis group a small section of the genome to run test analyses on before committing to a particular strategy to run across the whole genome. Everything in the chr 11 and chr 20 files is also represented in the mapped bam file. To get a complete view of what data we aligned you only need to download the mapped and unmapped bams, the chr 11 and chr 20 bams are there as a convenience to the analysis group.

IGSR: The International Genome Sample Resource

Supporting open human variation data

Links

Are there any scripts or APIs for use with the 1000 Genomes data sets?

Answer:

Related questions:

How are your alignments generated?

Answer:

Related questions:

How do I get a sub-section of a BAM file?

Answer:

Related questions:

What are CRAM files?

Answer:

Related questions:

What are the unmapped bams?

Answer:

Related questions:

What is a bas file?

Answer:

Related questions:

What is the Data Slicer?

Answer:

Related questions:

What format are your alignments in and what do the names mean?

Answer:

Related questions:

Where are your alignment files located?

Answer:

Related questions:

Where can I find phase3 alignment BAM files and read fastq files on the ftp site?

Answer:

Why are there chr 11 and chr 20 alignment files, and not for other chromosomes?

Answer:

Related questions: